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Abstract. We consider a general supervised learning problem with strongly 
^ . convex and Lipschitz loss and study the problem of model selection aggre- 

gation. In particular, given a finite dictionary functions (learners) together 
■ with the prior, we generalize the results obtained by Dai, Rigollet and 

Zhang (2012) for Gaussian regression with squared loss and fixed design 
to this learning setup. Specifically, we prove that the Q-aggregation pro- 
cedure outputs an estimator that satisfies optimal oracle inequalities both 
in expectation and with high probability. Our proof techniques somewhat 
depart from traditional proofs by making most of the standard arguments 
on the Laplace transform of the empirical process to be controlled. 

AMS 2000 subject classifications: Primary 62H25; secondary 62F04, 90C22. 
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1. INTRODUCTION AND MAIN RESULTS 



CN 
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Let X be a probability space and let (X, Y) G X x IR be a random couple. 
\Q | Broadly speaking, the goal of statistical learning is to predict Y given X. To 

achieve this goal, we observe a dataset V = {{X\, Yj.), . . . , (X n , Y n )} that consists 
of n independent copies of (X, Y) and use these observations to construct a 
function (learner) f : X — > IR such that f(X) is close to Y in a certain sense. 
More precisely, the prediction quality of a (possibly data dependent) function 
/ is measured by a risk function R : JR X — > IR associated to a loss function 
^ ! £ : IR 2 — > IR in the following way 



R(f)=JE[£(YJ(X))\V 

We focus hereafter on loss functions £ that are convex in their second argument. 
Moreover, for the sake of simplicity, throughout this article we restrict ourselves 
to functions / and random variables (X,Y) for which \Y\ < b and |/(-X")| < b 
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almost surely, for some fixed b > 0. For any real valued measurable / on X, for 
which this quantity is finite, we define ||/||2 = \/lE[/(X) 2 ] . 

We are given a finite set T = . . . , /m} of measurable functions from X to 
IR. This set is called a dictionary. The elements in T may have been constructed 
using an independent, frozen, dataset at some previous step or may simply be 
good candidates for the learning task at hand. To focus our contribution on the 
aggregation problem, we restrict our attention to the case where T consists of 
deterministic functions. The aim of model selection aggregation [27, 7, 8, 31] is to 
use the data T> to construct a function / having an excess-risk R{f) — miny e j- R(f) 
as small as possible. Namely, we seek the smallest deterministic residual term 
A n (J 7 ) > such that the excess risk is bounded above by A n (J 7 ), either in 
expectation or with high probability, or, in this instance, in both. In the high 
probability case, such bounds are called oracle inequalities. This problem was 
studied for instance in [2, 3, 6, 7, 14, 27, 18, 19, 23, 31, 32, 33, 34]. 

From a minimax standpoint, it has been proved that A n (J-*) = C(logM)/n, 
C > is the smallest residual term that one can hope for the regression prob- 
lem with quadratic loss [31]. An estimator / achieving such a rate (up to some 
multiplying constant) is called an optimal aggregate. The aim of this paper is to 
construct optimal aggregates under general conditions on the loss function £. 

Note that the optimal residuals for model selection aggregation are of the order 
1/n as opposed to the standard parametric rate 1 / yfn. This fast rate essentially 
comes from the strong convexity of the quadratic loss. In what follows we show 
that indeed, strong convexity is sufficient to obtain fast rates. It is known that 
rates of optional order 1/n cannot be achieved if the loss function is only assumed 
to be convex. Indeed, it follows from [21], Theorem 2 that if the loss is linear then 
the best achievable residual term is at least of the order \J (log |.F|)/n. Recall that 
a function g is said to be strongly convex on a nonempty convex set C C IR if 
there exists a constant c such that 



for any a, a' 6 C,a G (0,1). In this case, c is called modulus of strong convex- 
ity. For technical reasons, we will also need to assume that the loss function is 
Lipschitz. We now introduce the set of assumptions that are sufficient for our 
approach. 

Assumption 1. The loss function i is such that for any f,g € [—b,b], we 
have 



Moreover, almost surely, the function £(Y, •) is strongly convex with modulus of 
strong convexity Ci on [—6,6]. 

A central quantity that is used for the construction of aggregates is the empir- 
ical risk defined by 




\£(YJ)-e(Y,g)\ <C b \f-g\, a.s.. 





i=l 
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for any real- valued function / defined over X . A natural aggregation procedure 
consists in taking the function in T that minimizes the empirical risk. This proce- 
dure is called empirical risk minimization (ERM). It has been proved that ERM 
is suboptimal for the aggregation problem [19, 7, 24, 22, 26, 30]. Somehow, this 
procedure does not take advantages of the convexity of the loss since the class of 
functions on which the empirical risk is minimized to construct the ERM is J-, a 
finite set. As it turns out, the performance of ERM relies critically on the con- 
vexity of the class of functions on which the empirical risk is minimized [26, 24]. 
Therefore, a natural idea is to "improve the geometry" of J- by taking its convex 
hull conv(J r ) and then by minimizing the empirical risk over it. However, this 
procedure is also suboptimal [23, 9]. The weak point of this procedure lies in the 
metric complexity of the problem: taking the convex hull of T indeed "improves 
the geometry" of T but it also increases by too much its complexity. The complex- 
ity of the convex hull of a set can be much larger than the complexity of the set 
itself and this leads to a failure of this naive convexification trick. Nevertheless, a 
compromise between geometry and complexity was stricken in [2] and [23] where 
optimal aggregates have been successfully constructed. In [2], this improvement 
is achieved by minimizing the empirical risk over a carefully chosen star-shaped 
subset of the convex hull of T . In [23], a better geometry was achieved by tak- 
ing the convex hull of an appropriate subset of J- and then by minimizing the 
empirical risk over it. 

In this paper, we show that a third procedure, called Q-aggregation, and that 
was introduced in [28, 9] for fixed design Gaussian regression, also leads to optimal 
rates of aggregation. Unlike the above two procedures that rely on finding an 
appropriate constraint for ERM, Q-aggregation is based on a penalization of the 
empirical risk but the constraint set is kept to be the convex hull of T ' . Let Q 
denote the flat simplex of IR M defined by 

M 

= {(01, . . . , 9 M ) G H M : 6j > 0, J^Oj = 1} 

3=1 

and for any 9 G G, define the convex combination fg = ^2jL\ @jfj- For any fixed 
v , the Q-functional is defined for any 9 G G by 

M 

(1.2) Q{6) = (1 - v)R n {f e ) + vJ2 9 J R n(fj)' 

3=1 

We keep the terminology Q-aggregation from [9] in purpose. Indeed, Q stands for 
quadratic and while do not employ a quadratic loss, we exploit strong convexity 
in the same manner as in [9] and [28]. Indeed the first term in Q acts as a 
regularization of the linear interpolation of the empirical risk and is therefore a 
strongly convex regularization. 

We consider the following aggregation procedure. Unlike the procedures intro- 
duced in [2, 23], the Q-aggregation procedure allows us to put a prior weight given 
by a prior probability tt = (tti, . . . , ttm) on each element of the dictionary T . This 
feature turns out to be crucial for applications [1, 10, 11, 13, 14, 15, 12, 16, 29, 30]. 
Let (3 > be the temperature parameter and < u < 1. Consider any vector of 
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weights 9 £ defined by 



(1.3) 



G argmin 

see 



(1 - ls)R n (f e ) + vY, OjRn(fj) l0 % ^ 



M 



It comes out of our analysis that fx achieves an optimal rate of aggregation if (3 
satisfies 



(1.4) j3 > max 



12C fe 2 (l 



-, 6V3bC b (l - u), 



3C b v{vC h + 4/xb) 
2^ 



where // = min(z/, 1 — v)(Ce)/10 . 



Theorem A. Lei J 7 be a finite dictionary of cardinality M and (X, Y) be a 
random couple of X x 1R such that \Y\ < b and maxj e jr |/(X)| < b a.s. for some 
b > 0. Assume that Assumption 1 holds and that (3 satisfies (1.4). Then, for any 
x > 0, with probability greater than 1 — exp(— x) 



R(f § )< mhi + £ log (1 



i=l,...,M 



Moreover, 



< min 

j=l,..,M L 



R(fj) + -log 
n 



+ 



TT, 



2/3x 



If tt is the uniform distribution, that is ttj = 1/M for all j = 1, . . . ,M, then 
we recover in Theorem A the classical optimal rate of aggregation (log M)/n and 
the estimator 9 is just the one minimizing the Q-bmctional defined in (1.2). In 
particular no temperature parameter is needed for its construction. As a result, 
in this case, the parameter b need not be known for the construction of the 
Q-aggregation procedure. 

2. PRELIMINARIES TO THE PROOF OF THEOREM A 

An important part of our analysis is based upon concentration properties of 
empirical processes. While our proofs are similar to those employed in [28] and [9], 
they contain genuinely new arguments. In particular, this learning setting, unlike 
the denoising setting considered in [28, 9] allows us to employ various new tools 
such as symmetrization and contraction. A classical tool to quantify the concen- 
tration of measure phenomenon is given by Bernstein's inequality for bounded 
variables. In terms of Laplace transform, Bernstein's inequality [5, Theorem 1.10] 
states that if Z\, . . . , Z n are n i.i.d. real- valued random variables such that for all 
i = l,...,n, 

\ZA < c a.s. and EZ. 2 < v, 



then for any < A < 1/c, 



(2.5) 



n 

Eexp M j^{-Zj -EZi}) 

1 i=i 



< exp 



nv\ 



2(1 - cA) 



Bernstein's inequality usually yields a bound of order yfn for the deviations of 
a sum around its mean. As mentioned above, such bounds are not sufficient for 
our purposes and we thus consider the following concentration result. 
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Proposition 1. Let Z\, . . . , Z n be i.i.d. real-valued random variables and let 
cq > 0. Assume that \Z\\ < c a.s.. Then, for any < A < (2cq)/(1 + 2cqc), 



IE exp 



nX(- V Zi - JEZi - c JEZ; 



< 1 



and 



IE exp 



nX (- ^ cqMZI 



< 1. 



i=l 



Proof. It follows from Bernstein's inequality (2.5) that for any < A < 
(2co)/(l + 2coc), 



IE exp 



n\y- J^Zi- JEZi - c TEZj 



i=l 



< exp 



nEZ^A 2 
2(1 - cA) 



exp 



nXcoJEZ( 



< 1 



The second inequality is obtained by replacing Zi by —Zi 



We will also use the following exponential bound for Rademacher processes: let 
£\, . . . , e n be independent Rademacher random variables and a%, . . . , a n be some 
real numbers then, by Hoeffding's inequality, 

(2.6) E exp 

i=i i=i 

Our analysis also relies upon some geometric argument. Indeed, the strong 
convexity of the loss function in Assumption 1 implies the 2-convexity of the risk 
in the sense of [4] . This translates into a lower bound on the gain obtained when 
applying Jensen's inequality to the risk function R. 

PROPOSITION 2. Let (X, Y) be a random couple inXxM and T = {/i, . . . , Jm} 
be a finite dictionary in L^X ,Px) such that \fj(X)\ < b, Vj = 1,...,M and 
\Y\ < b a.s.. Assume that, almost surely, the function £(Y,-) is strongly convex 
with modulus of strong convexity Cg on [—6,6]. Then, it holds that, for any 6 G 0, 

M M „ m M 2 

(2.7) i?(E^/i) <EW)- yE%||/i-E^j| 2 - 

j=i j=i j=i j=i 

Proof. Define the random function £(■) = £(Y,-). By strong convexity and 
[17], Theorem 6.1.2, it holds almost surely that for any a, a' in [—6,6], 

£{a) > £{a') + (a - a')£'{a') + y (a - a') 2 , 

for any £'{a') in the sub-differential of £ at a'. Plugging a = fj(X), a' = fg(X), 
we get almost surely 

£(Y, MX)) > £(Y, fe(X)) + (f 3 (X) - f e {X))£'{f e (X)) + ^[f 3 (X) - f e (X)f . 
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Now, multiplying both sides by 9j and summing over j, we get almost surely, 

Y, w, /iW) > w few) + yE 9 ^ x ) - f^ 2 ■ 

j j 
To complete the proof, it remains to take the expectation. | 

3. PROOF OF THEOREM A 

Let x > and assume that Assumption 1 holds throughout this section. We 
start with some notation. For any 9 G 0, define 

£ e (y,x) = £(y,f e (x)) and R(0) = m e {Y,X) = m(YJ e (X)), 

where we recall that fg = J2j=i d jfj for any 6> G IR M Let < < 1. Let 
(d, . . . , eu) is the canonical basis of IR M and for any 9 G IR M define 

M 

£ e (y,x) = (1 - v)£f>(y,x) +vJ20 j £ ej (y,x) and R(9) = m(Y,X), 

i=i 

We also consider the functions 

M -, M 



9 g TR M i — y K{9) = J2 e J lo e f — ) and OeTR M ^ V{6) = J2 e J Wfj ~ fe 

3=1 ' -~J J 3=1 

Let \x > 0. Consider any oracle 9* G such that 



6* G argmin (r(9) + yV(0) + -K{9) 

We start with a geometrical aspect of the problem. The following inequality 
follows from the strong convexity of the loss function I. 

Proposition 3. For any 9 g 0, 
R(<?) - R(fl') > n{V(9*)-V{9)) +^{K{9*)-K{9)) + ( (1 ~^ Cl -y) \\f e -fe4l 

Proof. Since 9* is a minimizer of the (finite) convex function 9 t— >■ H(9) = 
R(9) + pV(9) + {j3/n)K(9) over the convex set 0, then there exists a subgradient 
VH(9*) such that for any 9 G it holds, (VH(9*), 9 - 9*) > 0. It yields 

(VR(0*),0 - 9*) > fi(VV(9*),9* -9) + (/3/n)(VK(9*),9* - 9) 
(3.8) = fj,(V(9*) - V{9)) - M \\f B - fe* \\l + W/n) (K(9*) - K{9)) . 

It follows from the strong convexity of £(y, •) that 
R(0)-R(0*) > (VR(9*),9- 9*) + {l ~" )Ce \\fe - /HI2 



> p{V{9*) - V{9)) + >-{K{9*) - K{9)) + ( v ^ 1 - p) \\f e - h 



|2 

r\\ 2 > 
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where the second inequality follows from the previous display. | 

Let H be the MxM matrix with entries Hj t k = \\fj — /fell 2 f° r an 1 — h ^ — M. 
Let s and x be positive numbers and consider the random variable 

M 

Z„ = (P - P n )(4~ -h*)-^Y. § J Wfi - f«' Ha " ^ Hr " • 

Proposition 4. Assume that 10/x < min(l — z/, and /3 > 3n/s. Then, it 
holds 

Pi „/ 1 



P(0) < min 

1<7<M 



P( ei ) + ^log( — 
n 



+ 2Z n 



Proof. First note that the following equalities hold: 

M 

(3-9) Y, § 3\\h- fe4l = V(e) + \\f § - fe^ 



2 || jt £ II 2 

2 



and 

(3.10) em* = v{9) + v(e*) + - f£ 2 . 

It follows from the definition of 9 that 

(3.11) R(0)-R(0*) < (P-P n )(^-V) + ^(iC(r)-^)). 

It follows from (3.9) and (3.10) in (3.11) that 
(3.12) 

R0) - R(0*) < 2fiV(9) + fj,V(0*) + 2/i || - f e * \\l + ±#(0) + ^{K(9*) - K{9)) + 

Together with Proposition 3, it yields 
f(l-u)Q 



v 3 M J || /--/ e *||;< 3/^(0) + -^(0) + Z„. 

We plug the above inequality into (3.12) to obtain 

^ ) _R r) <( 1+ _^_ )(l g( ,- ) + Zn 

(3.13) +£(*(,•) - *(«)) + „y (n + (2, + (1 _ y) ^_J v(i). 

Thanks to the 2-convexity of the risk (cf. Proposition 2), we have R(0) > R(0) + 
v{C t /2)V{9). Therefore, it follows from (3.13) that 

R(0) < R(0*) + /^(0*) + -K(0*) + (l + 1,1 



n V (1 — u)Ce — 6fi 

(3.14) 
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Note now that 10u < min(z^, 1 — v)Cg implies that 



4/i 



< 1 and 2[i 



; V < 

l-i/)Q-6/i 2 - 



(1 - v)C t - 6// 

Moreover, together, the two conditions of the proposition yield 

8/i 



1 

s s((l — ^)C^ — 6fi) n 



5<o 



Therefore, it follows from the above three displays that 

R(0) < min [R(0) + uV(6) + -K(6)] + 2Z n 



n 



< min 
j=l,...,M 



P 



R(e,-) + - log 



7T, 



+ 2Z n . 



To complete our proof, it remains to prove that ~P[Z n > (/3x)/n] < exp(— x) and 
< under suitable conditions on /i and f3. Using respectively a Chernoff 
bound and Jensen's inequality respectively, it is easy to see that both condi- 
tions follow if we prove that IEexp(nZ n //3) < 1. It follows from the excess loss 
decomposition: 

M 

£ § (y,x)-t e *(y,x) = (1 - u)(£ § (y, x) - £ e *(y,x)) + uj^03 ~ 

3=1 

and the Cauchy-Schwarz inequality implies that it is enough to prove that, 

M 



(3.15) Eexp s((l-i/)(P-P n )(^ 
and 



3=1 S 



< 1. 



M 

(3.16) lEexp [s(i/(P-P n )(^(e i - 0*)4,) - - -#(0) 



< 1 



3=1 



for some s > 2n//3. Let s be as such in the rest of the proof. 

We begin by proving (3.15). To that end, define the symmetrized empiri- 
cal process by h h-> P n ,eh = n _1 Y17=i £ ih(Yi, Xi) where ei,...,e n are n i.i.d. 
Rademacher random variables independent of the (Xi,Yi)s. Moreover, take s and 
u such that 



(3.17) 



s < 



un 



[2C b {l-v)f 
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It yields 
IE cxp 



M 

s((l- u)(P - P n )(£ § - V) -/*E*i ^ " /e *ll2 - - a K W 

3=1 



M 



< Eexp 

(3.18) 

< Eexp 

(3.19) 

< Eexp 



smax ((1 - u){P - P n )(i e -^)-/^E 



?i 11/^-/^112-7^) 



L smax (2(1 - v)P n , E (l e - t e *) - p£ 6 J II/* ~ 



12 --*(*) 



A/ 



smax (2C 6 (1 - v)P n ^j 6 - fr) - ||£ - - -#(0) 

3=1 



where (3.18) follows from the symmetrization inequality [20, Theorem 2.1] and 
(3.19) follows from the contraction principle [25, Theorem 4.12] applied to con- 
tractions ^{U) = C^ 1 [£(Y i J e .{X i ) - k) - £(Y i: fr(Xi)] and T C IR n is defined 
by T = {t G JR n : U = fa{Xi) - fe(Xi),8 G 6}. Next, using the fact that the 
maximum of a linear function over a polytope is attained at a vertex, we get 



Eexp 

M 



M 

8 [(l-u)(P-P n )(£ s -£ e .)-^ 

3=1 



h\\fj-fe*\\l--K0) 



< E TTfeEEe exp s(2C 6 (l - i/)P n , e (/ fc - ft.) - // ||/ fc - f e *\\l 



fc=l 



(3.20) 



A/ 



< 



fc=l 



(3.21) 



< 



exp 



exp 



[2C h (l - u)s)] 
2n 



p 



[7C b {\-v)f 8 



P)Uk-fe 



k=l 



(2C b (l-^) 2 / 2 1 

— \[P n - P)Uk- Je*) - -^2 F \Jk- Je 



where (3.20) follows from (2.6) and (3.21) follows from (3.17). Together with the 
above display, Proposition 1 yields (3.15) as long as 



(3.22) 



s < 



n 



2^36C b (l - v) ' 
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We now prove (3.15). We have 

M 



IE exp 



u(P - P n ) ( Ytfj - fy*ej ) - A10H0* - -K0) 



3=1 



M M 

^H jY1 ^ EeX P [ S ("( P - P nWe k ~ iej) ~ V Wfj ~ fk\ 
3=1 k=l 

M M 



3=1 k=l 



,) 



JJ_ 

vCl 



P(ie 



< 1 



where the last inequality follows from Proposition 1 when 

2fin 



(3.23) 



s < 



C h v{vC h + 4//fe) ' 



It is now straightforward to see that the conditions of Proposition 4, the ones of 
(3.17), (3.22) and (3.23) are fulfilled when 



3n C t 
s = — , u = mm za 1 — u) — 
P y ' 10 



and 



/3 > max 



12C 6 2 (1 - V ) 



6\/36C 6 (l - i/), 



3C fe t/(i/C b + 4/nb) 
2^ 
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